What is A/B testing?

“A/B testing (also known as bucket tests or split-run testing) is a randomized experiment with two variants, A and B. It includes application of statistical hypothesis testing or”two-sample hypothesis testing" as used in the field of statistics. A/B testing is a way to compare two versions of a single variable, typically by testing a subject’s response to variant A against variant B, and determining which of the two variants is more effective."

Source: Wikipedia

Image

Watch the video to understand the basics of A/B testing (also called split testing).

Randomized controlled experiment is vastly used by pharmaceutical companies, medical scientists, agricultural research, among others.

Example 1

The basic design of a randomised controlled trial (RCT), illustrated with a test of a new back to work programme (Haynes et. al, 2012, p.4)

Image

Example 2

Image

Example 3

Image

In summary, “A/B testing can be a randomized controlled experiment, assuming you’ve controlled factors and randomized subjects, but not all randomized controlled experiments are A/B tests.”

Why A/B testing is an important tool?

A/B testing is used to determine effects of digital marketing effort, specially because in this industry, small changes can have big effects.

Image

Image

How to use A/B testing?

To run an A/B Testing, it is necessary:

  • create two different versions of one piece of content
  • test these two versions to two similarly sized groups
  • analyze which one performed better over a specific period of time (long enough to make accurate conclusions about your results)

Image

Examples of A/B testing uses

  • change in a single headline
  • redesign of the webpage or app screen
  • color of a Call to Action button
  • the wording of a headline
  • the distribution of text on the page
  • the menu layouts

The A/B testing process

The following describes the basic flow of a scientific step-by-step process that you can use for A/B Split Testing.

  • Identify the problem
  • Get insights on targeted audience
  • Formulate a hypothesis
  • Calculate the required number of observations (visitors per day)
  • Test your hypothesis
  • Check out the impact of your variations based on measurable metrics
  • Data analysis
  • If no significant changes or improvements were generated, you may need to perform additional tests
  • Report

A/B testing errors

In hypothesis testing there are three possible outcomes of the test:

  • No error
  • Type I error
  • Type II error

With no error everything is clear.

Type I error (beware! this is a really serious error) occurs when you incorrectly reject the null hypothesis and conclude that there is actually a difference between the original page and the variation when there really isn’t. In other words, you obtain false positive test results. Like the name indicates, a false positive is when you think one of your test challengers is a winner while in reality it is not.

Type II error occurs when you fail to reject the null hypothesis at the right moment, obtaining this time false negative test results. Type II error occurs when we conclude test with the assumption that none of the variations beat the original page while in reality one of them actually did.

Type I and type II errors cannot happen at the same time:

  • Type I error happens only when the null hypothesis is true
  • Type II error happens only when hypothesis is false

Keep in mind that statistical errors are unavoidable.

However, the more you know how to quantify them the more you get accurate results.

When conducting hypothesis testing, you cannot “100%” prove anything, but you can get statistically significant results.

Source: A/B Testing Statistics Made Simple

Common test statistics

Source: Wikipedia

A/B testing in R

Now, that we’ve learned what A/B testing is, let’s do some A/B testing in R.

About the dataset

For this exercise, we are going to use the material and the ‘click_data.csv’ dataset used by DataCamp in the free Chapter of the A/B Testing in R course.

The dataset will be a generated example of a cat adoption website. You will investigate if changing the homepage image affects conversion rates (the percentage of people who click a specific button).
# Load the libraries
library(tidyverse)
library(data.table)

# Read in data
click_data <- fread('https://assets.datacamp.com/production/repositories/2292/datasets/4407050e9b8216249a6d5ff22fd67fd4c44e7301/click_data.csv')
click_data
##       visit_date clicked_adopt_today
##    1: 2017-01-01                   1
##    2: 2017-01-02                   1
##    3: 2017-01-03                   0
##    4: 2017-01-04                   1
##    5: 2017-01-05                   1
##   ---                               
## 3646: 2017-12-27                   1
## 3647: 2017-12-28                   0
## 3648: 2017-12-29                   0
## 3649: 2017-12-30                   1
## 3650: 2017-12-31                   0

Let’s find oldest and most recent date

min(click_data$visit_date)
## [1] "2017-01-01"
max(click_data$visit_date)
## [1] "2017-12-31"

Now that we know we have one year of data, let’s determine our baseline conversion rates.

Baseline conversion rates

What ‘more’ means in this context?

Compared the conversion rate to when?

Calculate the mean conversion rate by month

library(lubridate)
click_data$month <- month(click_data$visit_date)

click_data_month <- click_data %>%
  group_by(month) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
click_data_month
## # A tibble: 12 x 2
##    month conversion_rate
##    <dbl>           <dbl>
##  1     1           0.197
##  2     2           0.189
##  3     3           0.145
##  4     4           0.15 
##  5     5           0.258
##  6     6           0.333
##  7     7           0.348
##  8     8           0.542
##  9     9           0.293
## 10    10           0.161
## 11    11           0.233
## 12    12           0.465
library(scales)
ggplot(click_data_month, aes(x=month, y=conversion_rate)) +
  geom_point() + 
  geom_line() +
  scale_y_continuous(labels = scales::percent, limits=c(0,1)) 

Calculate the mean conversion rate by day of the week

click_data$wday <- wday(click_data$visit_date)

click_data_wday <- click_data %>%
  group_by(wday) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
click_data_wday
## # A tibble: 7 x 2
##    wday conversion_rate
##   <dbl>           <dbl>
## 1     1           0.3  
## 2     2           0.277
## 3     3           0.271
## 4     4           0.298
## 5     5           0.271
## 6     6           0.267
## 7     7           0.256
ggplot(click_data_wday, aes(x=wday, y=conversion_rate)) +
  geom_point() + 
  geom_line()+
  scale_y_continuous(labels = scales::percent, limits=c(0,1))

Calculate the mean conversion rate by week of the year

click_data$week <- week(click_data$visit_date)

click_data_week <- click_data %>%
  group_by(week) %>%
  summarize(conversion_rate = mean(clicked_adopt_today))
click_data_week
## # A tibble: 53 x 2
##     week conversion_rate
##    <dbl>           <dbl>
##  1     1           0.229
##  2     2           0.243
##  3     3           0.171
##  4     4           0.129
##  5     5           0.157
##  6     6           0.186
##  7     7           0.257
##  8     8           0.171
##  9     9           0.186
## 10    10           0.2  
## # ... with 43 more rows
ggplot(click_data_week, aes(x=week, y=conversion_rate)) +
  geom_point() + 
  geom_line() +
  scale_y_continuous(labels = scales::percent, limits=c(0,1))

Experimental design, power analysis

Based on the previous data analysis, we have our baseline numbers and we can determine for how long we should run our experiment.

But before start the experiment, we should ask some important questions:

  • How long should we run the experiment?
  • How many data points (sample size) should we get?
  • How many website hits we get per day?
  • What statistical test should I use?
  • What is the expected value for the test condition?

Besides anwering the questions, is important to determine:

  • The proportion of the data for each condition
  • Statistical significance threshold/alpha - level where effect sgnificant (generally is 0.05)
  • power/ 1 - beta - probability correctly rejecting null hypothesis (generally 0.8)

Power analysis in R

Now, let’s calculate the sample sizes using the package ‘powerMediation’.

Suppose we will run the experiment starting in January, we expect roughly a 20% conversion rate (p1), let’s assume that the test condition expect a convertion rate of 30% (p2).

For sample porportion (beta), alpha, and power we will use the most common values (0.5, 0.05, and 0.8, respectively)

library(powerMediation)

total_sample_size <- SSizeLogisticBin(p1 = 0.2, 
                 p2 = 0.3,
                 B = 0.5, 
                 alpha = 0.05,
                 power = 0.8)
total_sample_size
## [1] 587
total_sample_size/2
## [1] 293.5

Now is your turn.

  • How many data points you need in total across both conditions, if we decide to run the experiment in August? Make sure you round the percentages!

  • Let’s say you’ve reconsidered your expectations for running the experiment in August. Because, increasing the conversion rate by 10 percentage points may be difficult. Rerun your power analysis assuming only a 5 percentage point increase in your conversion rate for the test condition.

Now, we are going to use the exemple presented in the article Tips for A/B Testing with R. We will test the difference between two rates in R, e.g., click-through rates or conversion rates from two tested conditions.

library(readr) 

# Specify file path: 
dataPath <-   "https://www.inwt-statistics.de/files/INWT/downloads/exampleDataABtest.csv" 

# Read data 
data <- read_csv(file = dataPath)  

head(data)
## # A tibble: 6 x 3
##   group time                clickedTrue
##   <chr> <dttm>                    <dbl>
## 1 A     2016-06-02 02:17:53           0
## 2 A     2016-06-02 03:03:54           0
## 3 A     2016-06-02 03:18:56           1
## 4 B     2016-06-02 03:23:43           0
## 5 A     2016-06-02 04:04:00           0
## 6 A     2016-06-02 04:34:53           0
# Inspect structure of the data 
str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of  3 variables:
##  $ group      : chr  "A" "A" "A" "B" ...
##  $ time       : POSIXct, format: "2016-06-02 02:17:53" "2016-06-02 03:03:54" ...
##  $ clickedTrue: num  0 0 1 0 0 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   group = col_character(),
##   ..   time = col_datetime(format = ""),
##   ..   clickedTrue = col_double()
##   .. )
# Change type of group to factor  
data$group <- as.factor(data$group)  

# Change type of click through variable to factor
data$clickedTrue <- as.factor(data$clickedTrue) 
levels(data$clickedTrue) <- c("0", "1") 
str(data)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of  3 variables:
##  $ group      : Factor w/ 2 levels "A","B": 1 1 1 2 1 1 2 2 2 1 ...
##  $ time       : POSIXct, format: "2016-06-02 02:17:53" "2016-06-02 03:03:54" ...
##  $ clickedTrue: Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   group = col_character(),
##   ..   time = col_datetime(format = ""),
##   ..   clickedTrue = col_double()
##   .. )

Let’s find oldest and most recent date

min(data$time)
## [1] "2016-06-02 02:17:53 UTC"
max(data$time)
## [1] "2016-06-10 01:11:15 UTC"

To test the difference between two proportions, you can also use the Pearson’s chi-squared test. For small samples you should use Fisher’s exact test instead.

The prop.test function returns a p-value and a confidence interval for the difference between the two rates.

# Compute frequencies and conduct test for proportions  
# (Frequency table with successes in the first column) 
freqTable <- table(data$group, data$clickedTrue)[, c(2,1)]  

# print frequency table 
freqTable  
##    
##       1   0
##   A  20 480
##   B  40 460
# Conduct significance test 
prop.test(freqTable, conf.level = .95) 
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  freqTable
## X-squared = 6.4007, df = 1, p-value = 0.01141
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.071334055 -0.008665945
## sample estimates:
## prop 1 prop 2 
##   0.04   0.08

Based on the test result, with the significance level of 5%, we reject the null hypothesis (p-value = 0.01141), which means that there are statistical evidence that the condition A conversion rate differ from the tested condition (B).

  • What else can we say about the results?
  • How to interpret the confidence interval values?
  • What is the power of this experiment? Is it greater than 80%?

Congratulations everyone!!!

I am so proud of you!